Overview of the NLPCC-ICCPOL 2016 Shared Task: Chinese Word Segmentation for Micro-Blog Texts
نویسندگان
چکیده
In this paper, we give an overview for the shared task at the 5th CCF Conference on Natural Language Processing & Chinese Computing (NLPCC 2016): Chinese word segmentation for micro-blog texts. Different with the popular used newswire datasets, the dataset of this shared task consists of the relatively informal micro-texts. Besides, we also use a new psychometric-inspired evaluation metric for Chinese word segmentation, which addresses to balance the very skewed word distribution at different levels of difficulty. The data and evaluation codes can be downloaded from https://github.com/FudanNLP/ NLPCC-WordSeg-Weibo.
منابع مشابه
A Feature-Rich CRF Segmenter for Chinese Micro-Blog
This paper describes our system for Chinese word segmentation of micro-blog text, one of the NLPCC-ICCPOL 2016 Shared Tasks [1]. The CRF (Conditional Random Field) model is employed to model word segmentation as a sequence labeling problem, 7 sets of features are selected to train the CRF model. The system achieves fb 0.798144 on closed track, 0.81968 on semi-open track, and 0.82217 on open tra...
متن کاملOverview of the NLPCC 2015 Shared Task: Chinese Word Segmentation and POS Tagging for Micro-blog Texts
Word segmentation and Part-of-Speech (POS) tagging are two fundamental tasks for Chinese language processing. In recent years, word segmentation and POS tagging have undergone great development. The popular method is to regard these two tasks as sequence labeling problem, which can be handled with supervised learning algorithms such as Conditional Random Fields (CRF)[1]. However, the performanc...
متن کاملWord Segmentation on Micro-Blog Texts with External Lexicon and Heterogeneous Data
This paper describes our system designed for the NLPCC 2016 shared task on word segmentation on micro-blog texts (i.e., Weibo). We treat word segmentation as a character-wise sequence labeling problem, and explore two directions to enhance our CRF-based baseline. First, we employ a large-scale external lexicon for constructing extra lexicon features in the model, which is proven to be extremely...
متن کاملExploring Various Linguistic Features for Stance Detection
In this paper, we describe our participation in the fourth shared task (NLPCC-ICCPOL 2016 Shared Task 4) on the stance detection in Chinese Micro-blogs (subtask A). Different from ordinary features, we explore four linguistic features including lexical features, morphology features, semantic features and syntax features in Chinese micro-blogs in stance classifier, and get a good performance, wh...
متن کاملOverview of the NLPCC-ICCPOL 2016 Shared Task: Open Domain Chinese Question Answering
In this paper, we give the overview of the open domain Question Answering (or open domain QA) shared task in the NLPCC-ICCPOL 2016. We first review the background of QA, and then describe two open domain Chinese QA tasks in this year’s NLPCC-ICCPOL, including the construction of the benchmark datasets and the evaluation metrics. The evaluation results of submissions from participating teams are...
متن کامل